A Very Very Large Corpus Doesn't Always Yield Reliable Estimates
نویسندگان
چکیده
Banko and Brill (2001) suggested that the development of very large training corpora may be more effective for progress in empirical Natural Language Processing than improving methods that use existing smaller training corpora. This work tests their claim by exploring whether a very large corpus can eliminate the sparseness problems associated with estimating unigram probabilities. We do this by empirically investigating the convergence behaviour of unigram probability estimates on a one billion word corpus. When using one billion words, as expected, we do find that many of our estimates do converge to their eventual value. However, we also find that for some words, no such convergence occurs. This leads us to conclude that simply relying upon large corpora is not in itself sufficient: we must pay attention to the statistical modelling as well.
منابع مشابه
How textbooks (and learners) get it wrong: A corpus study of modal auxiliary verbs
Many elements contribute to the relative difficulty in acquiring specific aspects of English as a foreign language (Goldschneider & DeKeyser, 2001). Modal auxiliary verbs (e.g. could, might), are examples of a structure that is difficult for many learners. Not only are they particularly complex semantically, but especially in the Malaysian context ...
متن کاملReverse Peroneal Artery Flap for Large Heel and Sole Defects: A Reliable Coverage
BACKGROUND Large soft tissue defects of ankle and foot always have been challenging to reconstruct. Reverse sural flaps, free flaps have been used for this problem with variable success. Reverse peroneal artery flap is an option to use with reliability without microvascular repair. Connections of peroneal artery around talus and ankle joint are deep and reliable with anterior tibial and post...
متن کاملA Modification on Applied Element Method for Linear Analysis of Structures in the Range of Small and Large Deformations Based on Energy Concept
In this paper, the formulation of a modified applied element method for linear analysis of structures in the range of small and large deformations is expressed. To calculate deformations in the structure, the minimum total potential energy principle is used. This method estimates the linear behavior of the structure in the range of small and large deformations, with a very good accuracy and low...
متن کاملAnchor-Free Correlated Topic Modeling: Identifiability and Algorithm
In topic modeling, many algorithms that guarantee identifiability of the topics have been developed under the premise that there exist anchor words – i.e., words that only appear (with positive probability) in one topic. Follow-up work has resorted to three or higher-order statistics of the data corpus to relax the anchor word assumption. Reliable estimates of higher-order statistics are hard t...
متن کاملData Mining at the Intersection of Psychology and Linguistics
Large data resources play an increasingly important role in both linguistics and psycholinguistics. The first data resources used by both psychologists and linguists alike were word frequency lists such as Thorndike and Lorge (1944) and Kučera and Francis (1967). Although the Brown corpus on which the frequency counts of Kučera and Francis were based was very large for its time, comprising some...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002